Daniel Kim, Farzaan Hussain, Nishaanth Joopelli

Data Science Lab: Lab 2¶

Download data from Canvas/Files/Datasets/Lab2Data.zip

Submit:

  1. A pdf of your notebook with solutions.
  2. A link to your colab notebook or also upload your .ipynb if not working on colab.

Goals of this Lab¶

  1. Continue reviewing some important tools from probability and linear algebra, and linking these to why we need them for data science, and how to use them in Python.
  2. Specifically: the relationship between the covariance matrix and how skewed data points are. How to skew and unskew.
  3. This also starts giving us ideas on how to explore our data, and think about pre-processing the data.
In [1]:
# Some useful libraries
import numpy as np
import scipy as sp
import math

#Pandas for data structure and analysis tools
import pandas as pd

#seaborn and matplotlib for plotting
import seaborn as sns
import matplotlib.pyplot as plt

#for nice vector graphics
%matplotlib inline

from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

from numpy.random import default_rng
np.random.seed(42) # Fixed seed for reproducibility
rng = default_rng()
/tmp/ipykernel_3568994/827439876.py:17: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  set_matplotlib_formats('png', 'pdf')

Problem 1¶

Recall from the previous lab, that we saw that the eigenvalues of the covariance matrix give us the major and minor axes of the ellipse of the data points.

Problem 1, Part 1.

The code below provides another illustration of this. Generate a few different plots, with different rotations of the covariance matrix. I've provided all the code below. You just need to understand it and do some cut-and-paste to run it a couple times.

In [2]:
def plot_points(mu, cov, samples, subplot_index, theta=None):
    # Get the eigenvectors of the covariance matrix
    L, V = np.linalg.eigh(cov)

    # Generate multivariate normal samples
    X = rng.multivariate_normal(mu, cov, samples)

    # Create subplot at the given index
    plt.figure(figsize=(12, 8))  # Make the plot bigger
    plt.subplot(2, 3, subplot_index)

    # Plot the eigenvectors (directions) and the scatter points
    origin = np.array([[0, 0], [0, 0]]) # origin point
    plt.quiver(*origin, V[:, 0], V[:, 1], color=['r', 'b'], scale=1.5)
    plt.scatter(X[:, 0], X[:, 1])

    # Set the aspect ratio to 1
    axes = plt.gca()
    axes.set_aspect(1)

    # Adds title to graph with rotation and theta value
    if theta is not None:
        plt.title(f"Rotation: {theta:.2f} rad")
    else:
        plt.title("No Rotation")

# Number of samples
samples = 1000
mu = [0, 0]  # Mean at the origin (0, 0)
cov = [[1, 0.9], [0.9, 1]]  # Covariance matrix

# Plot the first plot without any rotation
plot_points(mu, cov, samples, 1) # Plots on subplot 1

# Plots with different rotations of covariance matrix
thetas = [-0.15 * math.pi, -0.45 * math.pi, -0.05 * math.pi, 0.3 * math.pi]  # Different angles in radians
# Uses enumerate to iterate through thetas list and count up starting from 2
for i, theta in enumerate(thetas, 2):
    R = np.array([[math.cos(theta), -math.sin(theta)], [math.sin(theta), math.cos(theta)]])
    cov2 = np.dot(np.dot(R, cov), R.T) # Rotated covariance matrix

    # Plot rotated distribution
    plot_points(mu, cov2, samples, i, theta=theta)

# Adjust layout and show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Problem 1 Part 2:

Generate 1,000 points from the original covariance matrix given above, call this $\Sigma$. Form the matrix:

$$ S = (\Sigma^{1/2})^{-1}, $$ i.e., the inverse matrix square root. You can compute this directly using the scipy command I've provided below.

Challenge question (optional): try to do this using only the numpy function np.linalg.eigh

Multiply each of the 1,000 points on the left by this matrix $S$, and plot them. If you do this right, you should get a cloud of points that looks spherical -- i.e., no rotation, no elongation.

In [3]:
# Generate 1000 points from the original covariance matrix.
samples = 1000
mu = [0, 0]  # Mean at the origin (0, 0)
cov = [[1, 0.9], [0.9, 1]]  # Covariance matrix

#(1000,2)
X = rng.multivariate_normal(mu, cov, samples)

# Now multiply each point on the left by the matrix
# that is the inverse matrix-square-root
#(2,2)
S = sp.linalg.inv(sp.linalg.sqrtm(cov))

# Multiply two matrices but dimensions need to be transposed to multiply matrices
# (2,2) x (1000, 2).T => (2, 2) x (2,1000) => (2,1000).T => (1000, 2)
X = np.dot(S, X.T).T

# Plot transformed points
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1])
plt.title("Transformed Points")
plt.show()
No description has been provided for this image

Problem 2: Two Uses of the Above¶

Problem 2 Part 1

Load the data X_aniso.npy, and plot it. You will see that it looks like a single elongated cloud of points.

In [5]:
X = np.load('X_aniso.npy')

# Create a scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], alpha=0.5)
plt.title('Scatter plot of X_aniso data')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
No description has been provided for this image

Problem 2, Part 2

Compute a transformation, i.e., a 2 by 2 matrix, to unskew the data. Left multiply each point by this matrix, and plot the tranformed data.

If you do this correctly, you will clearly be able to see that the data in fact form 2 clusters, not just 1.

Aside: clustering algorithms like K-Means Clustering and Spectral Clustering, will fail on the skewed data, but will succeed in identifying the clusters once the data have been unskewed.

In [6]:
cmatrix = np.cov(X.T)
eigenvalue, eigenvectors = np.linalg.eig(cmatrix)
tmatrix = eigenvectors.dot(np.diag(1.0/ np.sqrt(eigenvalue)))
X = X.dot(tmatrix)
plt.figure(figsize=(10,8))
plt.scatter(X[:, 0], X[:, 1], alpha=0.5)
plt.show
Out[6]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Problem 2, Part 3

Load the data in DF2, and plot it. Note that it seems that there are two points that are "outliers." We want to understand which point is more of an outlier.

In [7]:
DF2 = pd.read_csv('DF2')


plt.figure(figsize=(10, 8))
plt.scatter(DF2.iloc[:, 0], DF2.iloc[:, 1], alpha=0.5)
plt.grid(True)
plt.show()
No description has been provided for this image

Problem 2, Part 4

Unskew the data in a similar way, and plot the result again. If you do this correctly, you will see that one of the points is clearly far more of an outlier.

We can now compare the points, because what we have done is tranformed the data to have identify covariance. That means that the variance in each direction is 1, and hence we can directly compare and see that the points north west of the mass of points is more standard deviations away from the bulk of points, than the other point.

In [8]:
cmatrix = np.cov(DF2.T)
eigenvalue, eigenvectors = np.linalg.eig(cmatrix)
tmatrix = eigenvectors.dot(np.diag(1.0/ np.sqrt(eigenvalue)))
DF2 = DF2.dot(tmatrix)
plt.figure(figsize=(10,8))
plt.scatter(DF2.iloc[:, 0], DF2.iloc[:, 1], alpha=0.5)
plt.show
Out[8]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Problem 3¶

Correlations.

  • When given a data matrix, an easy way to tell if any two columns are correlated is to look at a scatter plot of each column against each other column. For a warm up, do this: Look at the data in DF1 in Lab2.zip. Which columns are (pairwise) correlated? Figure out how to do this with Pandas, and also how to do this with Seaborn.

  • Compute the covariance matrix of the data. You can use any command you like for this (e.g., np.cov) to compute the 4×4 matrix. Explain why the numbers that you get fit with the plots you got.

In [9]:
DF1 = pd.read_csv('DF1')
pd.plotting.scatter_matrix(DF1, figsize=(10, 10), diagonal='kde')
plt.show()
sns.pairplot(DF1)
plt.show()
No description has been provided for this image
No description has been provided for this image

The correlated columns are the ones that have a cloud in the shape of a straight line, or a bell curve. We can see examples of this through the use of both Pandas and Seaborn.

In [10]:
cmatrix1 = np.cov(DF1.T)
print(cmatrix1)
[[ 8.33416667e+06 -1.15308351e+01  2.54401717e+01 -1.16826502e+01
  -2.05101694e+01]
 [-1.15308351e+01  1.00155793e+00 -4.01175564e-03  9.91624091e-01
   4.12485289e-03]
 [ 2.54401717e+01 -4.01175564e-03  1.00537841e+00 -4.09877334e-03
  -9.95456622e-01]
 [-1.16826502e+01  9.91624091e-01 -4.09877334e-03  1.00158867e+00
   4.08107943e-03]
 [-2.05101694e+01  4.12485289e-03 -9.95456622e-01  4.08107943e-03
   1.00516828e+00]]

If we look at the correlation we can see that numbers close to 1 or -1 correspond to the plots that have clouds in the shape of straight lines or bell curves

Problem 4: Linear Regression¶

There are several main goals for this problem.

  1. This will help us remember that the solution to a regression problem is a random variable since after all, it calculated from random data.
  2. Since it is a random variable, it therefore has a mean and variance. We will be computing these analytically in class. Here you will compute the empirically.
  3. This is very important: if we can compute or estimate the variance of the solution to a regression problem, then we can determine its statistical significance.

Problem 4, Part 1.

Generate $n = 150$ data points as follows: $x_i \sim N(0,1)$, $e_i \sim N(0,1)$, and $y_i = \beta_0^* + x_i \beta_1^* + e_i$, where $\beta_0 = -3$ and $\beta_1^* = 0$. Note that since $\beta_1^* = 0$, this means that $y$ and $x$ are unrelated!

Use either the closed form expression for $\hat{\beta_1}$ or a linear regression package of your choice, to obtain the least-squares estimate for $\hat{\beta}_1$. Is it equal to $\beta_1^*$? Either way, explain.

In [11]:
def generate_regression():
  n = 150
  x = np.random.normal(0, 1, n)
  e = np.random.normal(0, 1, n)

  beta_0 = -3
  beta_1 = 0

  y = beta_0 + x * beta_1 + e
  estimate, _, _, _, _ = sp.stats.linregress(x, y)

  return estimate

print(generate_regression())
-0.02583583275262477

The least-squares estimate for $\hat{\beta_1}$ is not exactly equal to $\beta_1^*$, which is 0. However, this difference from 0 is small and not significant statistically, and can be explained by variability in the random generation.

Problem 4, Part 2

Now repeat this experiment 1000 times, always using the same ground truth. You will now have 1000 different values for $\hat{\beta}_1$. Plot these values in a histogram (set the bin size so that it looks nice). Note that for each of these 1000 values of $\hat{\beta}_1$, you need to generate $n=150$ data points, as you did above.

If you do this right, you should have something that looks like a Gaussian. This is telling us that the distribution of the solution to a regression problem looks like a Gaussian. Therefore, we can use everything we know about Gaussians, standard deviations, etc., to judge statistical significance.

In [12]:
beta_1_estimates = [generate_regression() for _ in range(1000)]

plt.figure(figsize=(10, 6))
plt.hist(beta_1_estimates, bins=100)
plt.xlabel('Estimated β₁')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Problem 5: Regression and Non-linear Fitting¶

In this exercise you'll explore two different approaches to fitting non-linear curves through data, but using linear regression (as we also did in class when considering the example of salary vs experience). Download the data in the "auto-mpg.data-original" data set from http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/.

  1. Display all pairwise relationships of the data. There are a number of packages that will do this for you. You can use, for example, Pandas, or you can use Seaborn, or something else of your choice.

  2. Now you will explore the relationship between mass and fuel efficiency. First, try to fit a straight line (not necessarily through the origin) to the data, using squared loss as your loss function. That is, try to find the best-fit relationship of the form: ${\rm MPG} = \beta_0 + \beta_1 {\rm WEIGHT}$.

  3. Now fit a quadratic, i.e., a relationship of the form: ${\rm MPG} = \beta_0 + \beta_1 {\rm WEIGHT} + \beta_2 {\rm WEIGHT}^2$. Do this in two different ways: (A) Use the numpy.polynomial package and ask it to fit a polynomial of degree 2. Plot the curve you get against the points. (B) Now repeat this, by explicitly adding a column that is the square of the weight, and then using linear (but multiple) regression.

In [13]:
# Part 1
# Create column names based on attributes
column_names = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name" ]

# Read data and seperate based on spaces to make dataframe
data = pd.read_csv("sample_data/auto+mpg/auto-mpg.data-original", sep='\s+', header=None, names=column_names)

#Display all the pairwise relationships of the data
sns.pairplot(data)
plt.show()

#Part 2
from sklearn import linear_model

# Drop rows with NaN values in 'weight' or 'mpg'
cleaned_data = data.dropna(subset=['weight', 'mpg'])

# Get Weight and  MPG columns from dataframe
X = cleaned_data["weight"].to_numpy()
Y = cleaned_data["mpg"].to_numpy()

# Find line of best fit using linear regression which uses the squared loss as its lloss function
lr = linear_model.LinearRegression()
lr.fit(X.reshape(-1,1),Y)

# Gets beta values of best-fit line
beta_0 = lr.intercept_
beta_1 = lr.coef_[0]

# Plot points and line of best fit
plt.figure(figsize=(10, 6))
plt.scatter(X, Y, color='black', label='Data points')
plt.plot(X, lr.predict(X.reshape(-1,1)), color='blue', label=f'Best fit')

# Part 3
# A
from numpy.polynomial import Polynomial

# Get quadratic fit using numpy.polynomial and plot
f = Polynomial.fit(X, Y, deg=2)

# Generate X values for the line and plot fitted quadratic
X_quad = np.linspace(X.min(), X.max(), 100)
Y_quad = f(X_quad)
plt.plot(X_quad, Y_quad, color='red', linestyle='-', linewidth=4, label=f'Quadratic Fit (numpy.polynomial)')

#B
# Column that is square of weight
X_squared = X ** 2

# Combine columns into 2d array for multi[ple regression, use zip to go through both X and X-squared simultaneously
combined_x = np.array([[x, x_sq] for x, x_sq in zip(X, X_squared)])

# Use linear(multiple) regression to find line of best fit
mr = linear_model.LinearRegression()
mr.fit(combined_x, Y)

# Generate X^2 values for fitted line and plot fitted quadratic
X_multi = np.linspace(X.min(), X.max(), 100)
X_multi_squared = X_multi ** 2
combined_x_multi = np.array([[x, x_sq] for x, x_sq in zip(X_multi, X_multi_squared)]) # Combine colmns for multiple regression
Y_multi = mr.predict(combined_x_multi)
plt.plot(X_multi, Y_multi, color='yellow', linestyle='--', label=f'Quadratic Fit (multiple regression)')

plt.legend()
plt.show()
<>:6: SyntaxWarning: invalid escape sequence '\s'
<>:6: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_3568994/2135910357.py:6: SyntaxWarning: invalid escape sequence '\s'
  data = pd.read_csv("sample_data/auto+mpg/auto-mpg.data-original", sep='\s+', header=None, names=column_names)
No description has been provided for this image
No description has been provided for this image

Problem 6 (Optional)¶

Run through the Jupyter Notebook from class on Child IQ and Mom IQ, from the textbook of Gelman and Hill (we will go over this in class this week). You will have to download the data set yourselves. Then do the exercise of adding an interaction term, as left for you at the end of the Jupyter notebook. Explain what you see, and how it relates to the graph you obtain before adding the interaction term. That is, use plots / visualization, to argue convincingly that the interaction term should or shouldn't be there, and then tell us what this means.

In [ ]: